I wanted to investigate trends in Irish road safety using public collision data from 2005 - 2020.
Questions:
Is there a month with a significant difference in collisions?
Is there a trend in collisions on Irish roads from 2005 to 2020? What predictions can we make?
Is there a County in Ireland that has a higher/lower amount of fatal collisions per capita?
Data Source
Data downloaded from Ireland's open Data initiative data.gov.ie
2002 - 2005 collision aggregate: https://data.gov.ie/dataset/roa17-traffic-collisions-and-casualties/resource/06ab21a7-92e5-4506-addc-f8e2486a8dfc
2013 - 2020 collision by County: https://data.gov.ie/dataset/roa27-traffic-collisions-and-casualities
County population data 2016: https://data.gov.ie/dataset/population-classified-by-area
Conclusion
There appears to be a downward trend in collisions on Irish roads for the time between 2005 - 2020. A low R-squared value of 0.05 indicates that the linear regression model has limited predictive power and is not able to explain the variation in the data effectively.
Further analysis and the inclusion of additional variables are recommended for a more accurate prediction of collision rates.
From 2013 to 2020 Longford has the highest total amount of fatal collisions on Irish roads while Dublin has the lowest amounts of fatal collisions.
Notes
Categories of 'Statistic label' include outcome of collision such as 'Fatal Collisions','Injury Collisions','All Fatal and Injury Collisions', 'Killed Casualties', 'Injured Casualties' 'All Killed and Injured Casualties'.
# import libraries
import pandas as pd # pandas for panel data
import matplotlib.pyplot as plt # pyplot for visualization
import seaborn as sns # seaborn for visualization
from sklearn.linear_model import LinearRegression # linear regression model for prediction of collisions
import plotly.express as px # plotly for interactive plots
import numpy as np # import numpy to make jitter for regression plot
# load the data
# Irish road collision data as monthly aggregates from 2005 - 2020
df = pd.read_csv("../data/ROA17.20230921155733.csv")
# Irish road collision data per County from 2013 to 2020
df_county_collision = pd.read_csv("../data/ROA27.20230924131436.csv")
# Irish County population data from 2016
df_county_pop = pd.read_csv("../data/county_population_2016.csv")
# Display the first few rows
df.head()
| STATISTIC | Statistic Label | TLIST(A1) | Year | C01885V02316 | Month of Year | UNIT | VALUE | |
|---|---|---|---|---|---|---|---|---|
| 0 | ROA17C1 | Fatal Collisions | 2005 | 2005 | - | All months | Number | 360.0 |
| 1 | ROA17C1 | Fatal Collisions | 2005 | 2005 | 01 | January | Number | 31.0 |
| 2 | ROA17C1 | Fatal Collisions | 2005 | 2005 | 02 | February | Number | 34.0 |
| 3 | ROA17C1 | Fatal Collisions | 2005 | 2005 | 03 | March | Number | 23.0 |
| 4 | ROA17C1 | Fatal Collisions | 2005 | 2005 | 04 | April | Number | 20.0 |
# Summary statistics - how are the values distributed here
# List of columns to include in the description (excluding 'Year')
columns_to_include = [col for col in df.columns if col != 'Year']
# Use describe() on the selected columns
df[columns_to_include].describe()
| TLIST(A1) | VALUE | |
|---|---|---|
| count | 1248.00000 | 1247.000000 |
| mean | 2012.50000 | 719.630313 |
| std | 4.61162 | 1504.400230 |
| min | 2005.00000 | 4.000000 |
| 25% | 2008.75000 | 24.000000 |
| 50% | 2012.50000 | 478.000000 |
| 75% | 2016.25000 | 642.000000 |
| max | 2020.00000 | 10037.000000 |
# Data information, data types, only missing data is single NULL in VALUE
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1248 entries, 0 to 1247 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 STATISTIC 1248 non-null object 1 Statistic Label 1248 non-null object 2 TLIST(A1) 1248 non-null int64 3 Year 1248 non-null int64 4 C01885V02316 1248 non-null object 5 Month of Year 1248 non-null object 6 UNIT 1248 non-null object 7 VALUE 1247 non-null float64 dtypes: float64(1), int64(2), object(5) memory usage: 78.1+ KB
# What kind of statistics are being labelled
df['Statistic Label'].unique()
array(['Fatal Collisions', 'Injury Collisions',
'All Fatal and Injury Collisions', 'Killed Casualties',
'Injured Casualties', 'All Killed and Injured Casualties'],
dtype=object)
# Create a copy of the original DataFrame and filter out 'All months' column to make plotting and merging easier
filtered_data = df.copy()
# Exclude rows where 'Month of Year' is 'All months' for plotting
filtered_data = filtered_data[filtered_data['Month of Year'] != 'All months']
# generate a plot looking at collision data at different months of the year
# Create a FacetGrid with 'Statistic Label' as the categorical variable
g = sns.FacetGrid(filtered_data, col='Statistic Label', col_wrap=3, height=4, sharey= False)
# Ensure correct order of months for the y-axis
month_order = ['January', 'February', 'March', 'April', 'May', 'June', 'July','August','September', 'October','November', 'December']
# use a palette for month colours
month_colors = sns.color_palette("husl", n_colors=len(month_order))
# Create bar plots in each grid
g.map(sns.barplot, 'Month of Year', 'VALUE' ,order = month_order, palette = month_colors)
g.set_axis_labels('Month', 'Collision Value')
g.set_titles(col_template='{col_name}')
# Adjust the layout
plt.subplots_adjust(top=0.9)
g.fig.suptitle('Monthly Statistics by Statistic Label', fontsize=16)
g.set_xticklabels(rotation = 90)
plt.tight_layout()
plt.show()
# generate a plot looking at different collision statistics from 2005 - 2020
# Create a FacetGrid with 'Statistic Label' as the categorical variable
g = sns.FacetGrid(filtered_data, col='Statistic Label', col_wrap=3, height=4, sharey= False)
# Create bar plots in each grid
g.map(sns.lineplot, 'Year', 'VALUE')
g.set_axis_labels('Year', 'Collision Value')
g.set_titles(col_template='{col_name}')
# Adjust the layout
plt.subplots_adjust(top=0.9)
g.fig.suptitle('Yearly Statistics by Statistic Label', fontsize=16)
# select only every 5 years for labelling
selected_years = [2005, 2010, 2015, 2020]
# Show the plot
plt.xticks(selected_years,rotation=90) # Rotate x-axis labels for better visibility
plt.tight_layout()
plt.show()
# Can use a linear regression model to predict how many 'Injury collisions' there will be in 2025.
# Get X and Y variables
injury_collision_df = filtered_data[filtered_data['Statistic Label'] == 'Injury Collisions']
# Create a linear regression model
regressor = LinearRegression()
# Fit the model to your data
X = injury_collision_df[['Year']]
y = injury_collision_df['VALUE']
regressor.fit(X, y)
# Check the R-squared value
r_squared = regressor.score(X, y)
print(f"R-squared value: {r_squared:.4f}")
R-squared value: 0.0528
# will get a warning over 'feature' labelling, but this does not effect the predictive model
import warnings
warnings.filterwarnings('ignore')
# Predict the collision data for 2025
predicted_2025 = regressor.predict([[2025]])
print("Predicted collision data for 2025:", predicted_2025[0])
predicted_2030 = regressor.predict([[2030]])
print("Predicted collision data for 2030:", predicted_2030[0])
Predicted collision data for 2025: 424.85477941176487 Predicted collision data for 2030: 407.91544117647027
plt.scatter(X, y, color='blue', s=6)
plt.plot(X, regressor.predict(X), color='red', label='Linear Regression Trend')
plt.xlabel('Year')
plt.ylabel('Collision Value')
plt.title('Collision Data Trend (2005-2020)')
plt.legend()
plt.show()
df_county_collision.head(), df_county_pop.head()
( STATISTIC Statistic Label TLIST(A1) Year C02451V02968 County \
0 ROA27C01 Fatal Collisions 2013 2013 - All Counties
1 ROA27C01 Fatal Collisions 2013 2013 01 Carlow
2 ROA27C01 Fatal Collisions 2013 2013 02 Dublin
3 ROA27C01 Fatal Collisions 2013 2013 03 Kildare
4 ROA27C01 Fatal Collisions 2013 2013 04 Kilkenny
UNIT VALUE
0 Number 179
1 Number 1
2 Number 18
3 Number 13
4 Number 4 ,
County Population(per 1,000)
0 Cavan 76.2
1 Donegal 159.2
2 Leitrim 32
3 Monaghan 61.4
4 Sligo 65.5)
# Create a copy of the original collision DataFrame and filter out 'All months' column to make plotting and merging easier
filtered_collision = df_county_collision.copy()
# Exclude rows where 'Month of Year' is 'All months' for plotting
filtered_collision = filtered_collision[filtered_collision['County'] != 'All Counties']
# ensure 26 counties are there in each df and that they are labelled the same
len(filtered_collision.County.unique()), len(df_county_pop.County.unique())
(26, 26)
# ensure both DFs Counties match so can merge on 'County'
# Check if all county names in collision_data exist in population_data
all_counties_matched = all(filtered_collision['County'].isin(df_county_pop['County']))
# Check if all county names in population_data exist in collision_data
all_counties_matched_reverse = all(df_county_pop['County'].isin(filtered_collision['County']))
# Determine if all county names match in both directions
if all_counties_matched and all_counties_matched_reverse:
print("All county names match in both data frames.")
else:
print("County names do not match in both data frames.")
All county names match in both data frames.
# merge the collision data with the county population and create normalized value for collision
merged_county = filtered_collision.merge(df_county_pop, on = 'County')
# change population column to type float and then calculate normalized value
merged_county['Population(per 1,000)'] = merged_county['Population(per 1,000)'].str.replace(',', '', regex=True)
merged_county['Population(per 1,000)'] = merged_county['Population(per 1,000)'].astype(float)
merged_county['Normalized Value'] = merged_county['VALUE'] / merged_county['Population(per 1,000)']
merged_county.head()
| STATISTIC | Statistic Label | TLIST(A1) | Year | C02451V02968 | County | UNIT | VALUE | Population(per 1,000) | Normalized Value | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | ROA27C01 | Fatal Collisions | 2013 | 2013 | 01 | Carlow | Number | 1 | 56.9 | 0.017575 |
| 1 | ROA27C01 | Fatal Collisions | 2014 | 2014 | 01 | Carlow | Number | 5 | 56.9 | 0.087873 |
| 2 | ROA27C01 | Fatal Collisions | 2015 | 2015 | 01 | Carlow | Number | 4 | 56.9 | 0.070299 |
| 3 | ROA27C01 | Fatal Collisions | 2016 | 2016 | 01 | Carlow | Number | 0 | 56.9 | 0.000000 |
| 4 | ROA27C01 | Fatal Collisions | 2017 | 2017 | 01 | Carlow | Number | 3 | 56.9 | 0.052724 |
# Create a plot looking at total number of fatal collisions, normalized per population, for each county
merged_county_fatal = merged_county[merged_county['Statistic Label'] == 'Fatal Collisions']
# group the data by County
group_by_county = merged_county_fatal.groupby('County')
# create a barplot to visualize different fuel contributions to primary energy
fig = px.bar(merged_county_fatal, x="County", y="Normalized Value",
color="County",
title="Total Normalized Fatal Collision Values by County (2013 - 2020)",
labels={"County": "County", "Normalized Value": "Total Normalized Collision Values"})
fig.update_xaxes(tickangle=-45) #rotate axes label and flip it
fig.show()